Introduction

This notebook presents an analysis of the data on 17007 strategy games available on the Apple App Store, such as Clash of CLans, Plants vs Zombies, Pokemon GO and others. This dataset was acquired from Kaggle.com, and it was collected on the 3rd of August 2019 using the iTunes API.

With this dataset, we may be able to analyze what factors make a sucessful game.

Loading the dataset

To start this analysis, we first load the required packages (tidyverse, readr) and read the csv file provided by Kaggle.

if(!require(tidyverse)){install.packages("tidyverse")}
if(!require(readr)){install.packages("readr")}
if(!require(DT)){install.packages("DT")}
options(scipen=10000)

appstoreGamesFile = "data/appstore_games.csv"
appstoreGamesDF = read_csv(appstoreGamesFile) %>% rename_all(~str_replace_all(., "\\s+", ""))
summary(appstoreGamesDF)
##      URL                  ID                 Name          
##  Length:17007       Min.   : 284921427   Length:17007      
##  Class :character   1st Qu.: 899654330   Class :character  
##  Mode  :character   Median :1112286228   Mode  :character  
##                     Mean   :1059613815                     
##                     3rd Qu.:1286982837                     
##                     Max.   :1475076711                     
##                                                            
##    Subtitle           IconURL          AverageUserRating UserRatingCount  
##  Length:17007       Length:17007       Min.   :1.000     Min.   :      5  
##  Class :character   Class :character   1st Qu.:3.500     1st Qu.:     12  
##  Mode  :character   Mode  :character   Median :4.500     Median :     46  
##                                        Mean   :4.061     Mean   :   3306  
##                                        3rd Qu.:4.500     3rd Qu.:    309  
##                                        Max.   :5.000     Max.   :3032734  
##                                        NA's   :9446      NA's   :9446     
##      Price          In-appPurchases    Description       
##  Min.   :  0.0000   Length:17007       Length:17007      
##  1st Qu.:  0.0000   Class :character   Class :character  
##  Median :  0.0000   Mode  :character   Mode  :character  
##  Mean   :  0.8134                                        
##  3rd Qu.:  0.0000                                        
##  Max.   :179.9900                                        
##  NA's   :24                                              
##   Developer          AgeRating          Languages        
##  Length:17007       Length:17007       Length:17007      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##       Size            PrimaryGenre          Genres         
##  Min.   :     51328   Length:17007       Length:17007      
##  1st Qu.:  22950144   Class :character   Class :character  
##  Median :  56768954   Mode  :character   Mode  :character  
##  Mean   : 115706430                                        
##  3rd Qu.: 133027072                                        
##  Max.   :4005591040                                        
##  NA's   :1                                                 
##  OriginalReleaseDate CurrentVersionReleaseDate
##  Length:17007        Length:17007             
##  Class :character    Class :character         
##  Mode  :character    Mode  :character         
##                                               
##                                               
##                                               
## 

As seen by the summary, there are 18 columns in this dataset:

We need to fix the typing of some columns, such as the release dates.

fixedAppstoreGamesDF <- appstoreGamesDF %>%
  mutate(OriginalReleaseDate = as.Date(OriginalReleaseDate, "%d/%m/%Y")) %>%
  mutate(CurrentVersionReleaseDate = as.Date(CurrentVersionReleaseDate, "%d/%m/%Y")) %>%
  mutate(AgeRating = factor(AgeRating, levels=c('4+','9+', '12+', '17+')))
## Warning: The `printer` argument is deprecated as of rlang 0.3.0.
## This warning is displayed once per session.
appstoreGamesDF <- fixedAppstoreGamesDF
datatable(appstoreGamesDF %>% select(-URL, -ID, -Subtitle, -IconURL, -Description))
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
summary(appstoreGamesDF)
##      URL                  ID                 Name          
##  Length:17007       Min.   : 284921427   Length:17007      
##  Class :character   1st Qu.: 899654330   Class :character  
##  Mode  :character   Median :1112286228   Mode  :character  
##                     Mean   :1059613815                     
##                     3rd Qu.:1286982837                     
##                     Max.   :1475076711                     
##                                                            
##    Subtitle           IconURL          AverageUserRating UserRatingCount  
##  Length:17007       Length:17007       Min.   :1.000     Min.   :      5  
##  Class :character   Class :character   1st Qu.:3.500     1st Qu.:     12  
##  Mode  :character   Mode  :character   Median :4.500     Median :     46  
##                                        Mean   :4.061     Mean   :   3306  
##                                        3rd Qu.:4.500     3rd Qu.:    309  
##                                        Max.   :5.000     Max.   :3032734  
##                                        NA's   :9446      NA's   :9446     
##      Price          In-appPurchases    Description       
##  Min.   :  0.0000   Length:17007       Length:17007      
##  1st Qu.:  0.0000   Class :character   Class :character  
##  Median :  0.0000   Mode  :character   Mode  :character  
##  Mean   :  0.8134                                        
##  3rd Qu.:  0.0000                                        
##  Max.   :179.9900                                        
##  NA's   :24                                              
##   Developer         AgeRating    Languages              Size           
##  Length:17007       4+ :11806   Length:17007       Min.   :     51328  
##  Class :character   9+ : 2481   Class :character   1st Qu.:  22950144  
##  Mode  :character   12+: 2055   Mode  :character   Median :  56768954  
##                     17+:  665                      Mean   : 115706430  
##                                                    3rd Qu.: 133027072  
##                                                    Max.   :4005591040  
##                                                    NA's   :1           
##  PrimaryGenre          Genres          OriginalReleaseDate 
##  Length:17007       Length:17007       Min.   :2008-07-11  
##  Class :character   Class :character   1st Qu.:2014-09-23  
##  Mode  :character   Mode  :character   Median :2016-07-09  
##                                        Mean   :2016-03-04  
##                                        3rd Qu.:2017-12-07  
##                                        Max.   :2019-10-26  
##                                                            
##  CurrentVersionReleaseDate
##  Min.   :2008-08-01       
##  1st Qu.:2016-04-17       
##  Median :2017-07-24       
##  Mean   :2017-04-26       
##  3rd Qu.:2018-11-19       
##  Max.   :2019-10-26       
## 

Univariate Plots

Right now I have no hypotheses to check, but lets create some plots to see the current state of the games released on the app store.

First, the number of games released each year. We can see by the plot that the number of games released had been increasing up until 2016. 2017 and 2018 had fewer games released. 2019 is not yet over, so it may catch up to the previous years.

 appstoreGamesDF %>%
  select(OriginalReleaseDate) %>%
  mutate(OriginalReleaseYear = format(OriginalReleaseDate, "%Y")) %>%
  group_by(OriginalReleaseYear) %>%
  summarise(count = n()) %>%
    ggplot(aes(x=OriginalReleaseYear, y=count)) +
    geom_col() +
    geom_text(aes(label=count), vjust=-0.25) + 
    ylab("Number of games released") +
    xlab("Year of Release") +
    theme_minimal()

 appstoreGamesDF %>% select(-URL, -ID, -Subtitle, -IconURL, -Description) %>%
  select(CurrentVersionReleaseDate) %>%
  mutate(CurrentVersionRelease = format(CurrentVersionReleaseDate, "%Y")) %>%
  group_by(CurrentVersionRelease) %>%
  summarise(count = n()) %>%
    ggplot(aes(x=CurrentVersionRelease, y=count)) +
    geom_col() +
    geom_text(aes(label=count), vjust=-0.25) + 
    scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
    theme_minimal()

unique(appstoreGamesDF$AverageUserRating)
##  [1] 4.0 3.5 3.0 2.5  NA 2.0 4.5 1.5 5.0 1.0
 appstoreGamesDF %>%
  select(AverageUserRating) %>%
  filter(!is.na(AverageUserRating)) %>%
  group_by(AverageUserRating) %>%
  summarise(count = n()) %>%
    ggplot(aes(x=AverageUserRating, y=count)) +
    geom_col() +
    geom_text(aes(label=count), vjust=-0.25) +
    scale_x_continuous(breaks = seq(1,5,by=0.5)) +
    scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
    theme_minimal()

 appstoreGamesDF %>%
  select(AgeRating) %>% 
  arrange(AgeRating) %>%
  group_by(AgeRating) %>%
  summarise(count = n()) %>%
    ggplot(aes(x=AgeRating, y=count)) +
    geom_col() +
    geom_text(aes(label=count), vjust=-0.25) + 
    theme_minimal()

 appstoreGamesDF %>%
  select(ID, Languages) %>%
  separate_rows(Languages, sep=",") %>%
  drop_na(Languages) %>%
  group_by(Languages) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  top_n(20) %>%
    ggplot(aes(x=reorder(Languages,desc(count)), y=count)) +
    geom_col() +
    geom_text(aes(label=count), vjust=-0.25, size=3.5) +
    scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
    theme_minimal()
## Warning: `list_len()` is deprecated as of rlang 0.2.0.
## Please use `new_list()` instead.
## This warning is displayed once per session.
## Selecting by count

 appstoreGamesDF %>%
  select(ID, Genres) %>%
  separate_rows(Genres, sep=",") %>%
  drop_na(Genres) %>%
  group_by(Genres) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  top_n(20) %>%
    ggplot(aes(x=reorder(Genres,desc(count)), y=count)) +
    geom_col() +
    geom_text(aes(label=count), vjust=-0.25, size=3.5) +
    scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle=90,vjust= 0.2,hjust=1))
## Selecting by count

  appstoreGamesDF %>%
  select(Size) %>%
  filter(!is.na(Size))%>%
  arrange(Size) %>%
    ggplot(aes(x=Size)) +
    geom_histogram(bins=30) +
    theme_minimal()

  appstoreGamesDF %>%
  select(UserRatingCount) %>%
  filter(!is.na(UserRatingCount))%>%
  filter(UserRatingCount>=10000)%>%
  arrange(UserRatingCount) %>%
    ggplot(aes(x=UserRatingCount)) +
    geom_histogram(bins=10) +
    theme_minimal()

  appstoreGamesDF %>%
  select(Price) %>%
  filter(!is.na(Price)) %>%
  #(Price>00) %>%
    ggplot(aes(x=Price))+
    geom_histogram(bins=100)+
    theme_minimal()

Multivariate plots

Questions

  • Average User Rating: Rounded to nearest .5, requires at least 5 ratings
  • User Rating Count: Number of ratings internationally, null means it is below 5
  • Price: Price in USD
  • In-app Purchases: Prices of available in-app purchases
  • Developer: App developer
  • Age Rating: Either 4+, 9+, 12+ or 17+
  • Languages: ISO2A language codes
  • Size: Size of the app in bytes
  • Primary Genre: Main genre
  • Genres: Genres of the app
  • Original Release Date: When it was released
  • Current Version Release Date: When it was last updated

Unfortunately, there is no information regarding the revenue these games make. We can only speculate that any user that reviews a non-free game has bought it at least once. Thus, we can have model of how much money a game has made compared to others. Of course, this does not consider games with in-app purchases, which is not only the the most common type of game in the Apple Store, but they are also the games that usually make the most amount of money in the mobile gaming community according to the news.

With this crude model, we can relate how most variables impact the revenue of a game: e.g., the amount of languages, a specific language, the genres, the release date, the age rating, the app size, and maybe others.

Is there a correlation between age rating and the languages available.

  # appstoreGamesDF %>%
  #  select(ID,AgeRating, Languages) %>%
  #  separate_rows(Languages, sep=",") %>%
  #  drop_na(Languages) %>%
  #  group_by(ID,AgeRating) %>%
  #  summarise(numberOfLanguages = n())
 #  arrange(desc(numberOfLanguages))


 appstoreGamesDF %>%
  select(ID,AgeRating, Languages) %>%
  separate_rows(Languages, sep=",") %>%
  drop_na(Languages) %>%
  group_by(ID,AgeRating) %>%
  summarise(numberOfLanguages = n()) %>%
  #ungroup %>%
  #group_by(AgeRating) %>%
  #summarise(averageNumberOfLanguages = mean(numberOfLanguages)) %>%
    ggplot(aes(x=AgeRating, y=numberOfLanguages)) +
    geom_boxplot() +
    geom_jitter(width = 0.3) +
    #geom_text(aes(label=averageNumberOfLanguages), vjust=-0.25, size=3.5) +
    scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
    coord_cartesian(ylim=c(0,90)) +
    theme_minimal()

Is there a correlation between age rating and genre?

 appstoreGamesDF %>%
  select(ID,AgeRating, Genres) %>%
  separate_rows(Genres, sep=",") %>%
  drop_na(Genres) %>%
  group_by(ID,AgeRating) %>%
  summarise(numberOfGenres = n()) %>%
  #summarise(averageNumberOfLanguages = mean(numberOfLanguages)) %>%
    ggplot(aes(x=AgeRating, y=numberOfGenres)) +
    geom_boxplot() +
    geom_jitter(width = 0.3) +
    #geom_text(aes(label=averageNumberOfLanguages), vjust=-0.25, size=3.5) +
    scale_y_continuous(expand = expand_scale(mult=c(0,0.05))) +
    theme_minimal()

Is there a correlation between age rating and average user rating?

To compare the User ratings for each Age Rating category, I summed the total amount of user ratings for each rating level and then calculated the ratio of that amount to the total amount of user ratings. This is displayed in the stacked bar chart below.

appstoreGamesDF %>%
  drop_na(AverageUserRating) %>%
  arrange(AverageUserRating) %>%
  pull(AverageUserRating) %>%
  unique() -> AverageUserRatingLevels #Get a vector containing all possible user rating levels in sequential order.

appstoreGamesDF %>%
  select(ID,AgeRating, AverageUserRating) %>%
  drop_na(AverageUserRating) %>%
  mutate(AverageUserRating = factor(AverageUserRating, levels = AverageUserRatingLevels)) %>%
  group_by(AgeRating, AverageUserRating) %>%
  summarise(count = n()) %>%
  mutate(freq = count / sum(count)) %>%
  ggplot(aes(x=reorder(AgeRating,desc(AgeRating)), y=freq, fill=AverageUserRating)) + 
  geom_col(position = position_stack(reverse = TRUE))  +
  scale_fill_brewer(palette = "RdYlGn") +
  geom_text(aes(label=count), size=4 ,position=position_stack(vjust = .5, reverse = TRUE)) + 
  theme_minimal() +
  xlab("Age Rating") +
  ylab("Proportion (%)") +
  labs(fill="Average\nUser Rating") +
  coord_flip()

Is there a correlation between age rating and user rating count?

Is there a correlation between price and age rating? …between price and the presence of In-app Purchases, … between price and user rating count? … between price and language? … between prices and genre? … between price and release date? … between price and AppSize?

Is there a correlation between original release date and current version release date? … original release date and AppSize. … original release date and Genre … original release date and Language